W1: Database Concepts, DESCRIBE, SELECT, WHERE

Welcome!

Please sign-up for an account at Posit Cloud and accept our classroom invitation here: https://posit.cloud/spaces/689711/join?access_code=8kse5IYlL4kHIqZvKaQ6mXp8IMibFayMa10I8Izn

Our course website: https://intro-sql-fh.netlify.app/

Introductions

  • Who am I?
  • Who are you?

    • Name, pronouns, group you work in

    • What you want to get out of the class

    • What has brought you joy lately?

  • Our wonderful TAs!

Goals of the course

  • Fundamentals of SQL query writing: filtering, joining, grouping.
  • Not so much about building your own database and optimizing it.

Content of the course

  1. Database Concepts, DESCRIBE, SELECT, WHERE

  2. JOINing tables

  3. [No class week]

  4. Grouping and Aggregating variables

  5. Subqueries, Views, Pizza

Culture of the course

  • Challenge: We are learning a new language, but you already have a full-time job.
  • Teach not for mastery, but teach for empowerment to learn effectively.
  • Teach at learner’s pace.

Culture of the course

  • Challenge: We sometimes struggle with our data science problems in isolation, unaware that other folks are working on similar things.
  • We learn and work better with our peers.
  • We encourage discussion and questions, as others often have similar questions also.

Format of the course

  • Hybrid, and recordings will be available.
  • 1 hour exercises after each session are encouraged for practice.
  • Office hours 11:30-Noon before class.

Badge of completion

We offer a badge of completion when you finish the course!

What it is:

  • A display of what you accomplished in the course, shareable in your professional networks such as LinkedIn, similar to online education services such as Coursera.

What it isn’t:

  • Accreditation through an university or degree-granting program.

Requirements:

  • Complete badge-required sections of the exercises for 3 out of 4 assignments.

Databases…

  • What are some Databases you are interested in?
  • Why do we need a Database Management System (DBMS) to manage it? (What could go wrong in managing a spreadsheet?)

Benefits of a DBMS:

  • Data Integrity: What are the rules within the database? If it is a medical database, does a patient always have a visit site? How do we deal with missing data? Are duplicated entries allowed?
  • Implementation: How do you find a particular record? What if we now want to create a new application that uses the same database? What if that application is running on a different machine?
  • Durability: What if the machine crashes while our program is updating a record? What if we want to replicate the database on multiple machines?

Database Management System (DBMS) consists of

  • A user interface - how users interact with the database. In this class, our main way of interacting with databases is SQL (Structured Query Language).

  • An execution engine - a software system that queries the data in storage. These can live on our machine, on a server within our network, or a server on the cloud.

  • Data Storage - the physical location where the data is stored.

DBMS examples

This class Example Hutch on-site database system Example Hutch cloud database system
User Interface SQL SQL SQL
Execution Engine DuckDB SQL Server Databrick/Snowflake
Data Storage File on our machine FH Shared Storage Amazon S3 Bucket

Our underlying data model

Relational Database: Data is organized into multiple tables. Tables are connected via columns that share the same elements across tables.

Person table

person_id year_of_birth gender_source_value
001 1/1/1999 F
002 12/31/1999 F
003 6/1/2000 M

Procedure Occurrence table

procedure_occurrence_id person_id procedure_datetime
101 001 4/1/2010
102 003 6/1/2022
103 004 5/1/2001

Entity Relationship Diagram

A short survey on your interest and background

https://forms.gle/YADmDmukRKmGk2KFA

Let’s get started: connecting to the database

library(DBI)

con <- DBI::dbConnect(duckdb::duckdb(), "../data/GiBleed_5.3_1.1.duckdb")

What are the available tables?

SHOW TABLES
Displaying records 1 - 10
name
care_site
cdm_source
concept
concept_ancestor
concept_class
concept_relationship
concept_synonym
condition_era
condition_occurrence
cost

Describing a table

DESCRIBE person
Displaying records 1 - 10
column_name column_type null key default extra
person_id INTEGER YES NA NA NA
gender_concept_id INTEGER YES NA NA NA
year_of_birth INTEGER YES NA NA NA
month_of_birth INTEGER YES NA NA NA
day_of_birth INTEGER YES NA NA NA
birth_datetime TIMESTAMP YES NA NA NA
race_concept_id INTEGER YES NA NA NA
ethnicity_concept_id INTEGER YES NA NA NA
location_id INTEGER YES NA NA NA
provider_id INTEGER YES NA NA NA

Data Types

If you look at the column_type for one of the DESCRIBE statements above, you’ll notice there are different data types:

  • INTEGER
  • TIMESTAMP
  • DATE
  • VARCHAR

You can see all of the datatypes that are available in DuckDB here.

SELECT and FROM

SELECT is a clause that lets you pick out columns of interest. If you want all columns, use *.

FROM is a clause that lets you decide which table to work with.

SELECT * 
  FROM person 
  LIMIT 10;
Displaying records 1 - 10
person_id gender_concept_id year_of_birth month_of_birth day_of_birth birth_datetime race_concept_id ethnicity_concept_id location_id provider_id care_site_id person_source_value gender_source_value gender_source_concept_id race_source_value race_source_concept_id ethnicity_source_value ethnicity_source_concept_id
6 8532 1963 12 31 1963-12-31 8516 0 NA NA NA 001f4a87-70d0-435c-a4b9-1425f6928d33 F 0 black 0 west_indian 0
123 8507 1950 4 12 1950-04-12 8527 0 NA NA NA 052d9254-80e8-428f-b8b6-69518b0ef3f3 M 0 white 0 italian 0
129 8507 1974 10 7 1974-10-07 8527 0 NA NA NA 054d32d5-904f-4df4-846b-8c08d165b4e9 M 0 white 0 polish 0
16 8532 1971 10 13 1971-10-13 8527 0 NA NA NA 00444703-f2c9-45c9-a247-f6317a43a929 F 0 white 0 american 0
65 8532 1967 3 31 1967-03-31 8516 0 NA NA NA 02a3dad9-f9d5-42fb-8074-c16d45b4f5c8 F 0 black 0 dominican 0
74 8532 1972 1 5 1972-01-05 8527 0 NA NA NA 02fbf1be-29b7-4da8-8bbd-14c7433f843f F 0 white 0 english 0
42 8532 1909 11 2 1909-11-02 8527 0 NA NA NA 0177d2e0-98f5-4f3d-bcfd-497b7a07b3f8 F 0 white 0 irish 0
187 8507 1945 7 23 1945-07-23 8527 0 NA NA NA 07a1e14d-73ed-4d3a-9a39-d729745773fa M 0 white 0 irish 0
18 8532 1965 11 17 1965-11-17 8527 0 NA NA NA 0084b0fe-e30f-4930-b6d1-5e1eff4b7dea F 0 white 0 english 0
111 8532 1975 5 2 1975-05-02 8527 0 NA NA NA 0478d6b3-bdb3-4574-9b93-cf448d725b84 F 0 white 0 english 0

LIMIT n let’s you look at the first n entries.

We put multiple SQL clauses together to form a query.

Try it out yourself on procedure_occurrence table. Why is there a person_id column in this table as well?

SELECT for specific columns

Instead of * for all columns, we can specify the columns of interest:

SELECT person_id, birth_datetime, gender_concept_id 
  FROM person
  LIMIT 10;
Displaying records 1 - 10
person_id birth_datetime gender_concept_id
6 1963-12-31 8532
123 1950-04-12 8507
129 1974-10-07 8507
16 1971-10-13 8532
65 1967-03-31 8532
74 1972-01-05 8532
42 1909-11-02 8532
187 1945-07-23 8507
18 1965-11-17 8532
111 1975-05-02 8532

Try add race_concept_id and year_of_birth to your SELECT query.

WHERE - filtering our table

Adding WHERE to our SQL statement lets us add filtering to our query:

SELECT person_id, gender_source_value, race_source_value, year_of_birth 
  FROM person 
  WHERE year_of_birth < 2000
Displaying records 1 - 10
person_id gender_source_value race_source_value year_of_birth
6 F black 1963
123 M white 1950
129 M white 1974
16 F white 1971
65 F black 1967
74 F white 1972
42 F white 1909
187 M white 1945
18 F white 1965
111 F white 1975

You don’t need to include the columns you’re filtering via WHERE in the SELECT part of the statement:

SELECT person_id, gender_source_value, race_source_value 
  FROM person 
  WHERE year_of_birth < 2000
Displaying records 1 - 10
person_id gender_source_value race_source_value
6 F black
123 M white
129 M white
16 F white
65 F black
74 F white
42 F white
187 M white
18 F white
111 F white

Single quotes and WHERE

Single quotes (‘M’) refer to values, and double quotes refer to columns (“person_id”).

This will trip you up several times if you’re not used to it.

SELECT person_id, gender_source_value
  FROM person 
  WHERE gender_source_value = 'M'
  LIMIT 10;
Displaying records 1 - 10
person_id gender_source_value
123 M
129 M
187 M
40 M
53 M
78 M
69 M
248 M
105 M
49 M

COUNT - how many entries?

Sometimes you want to know the size of your result, not necessarily return the entire set of results. That is what COUNT is for.

SELECT COUNT(*)
  FROM procedure_occurrence;
1 records
count_star()
37409

Similarly, when we want to count the number of person_ids returned, we can use COUNT(person_id):

SELECT COUNT(procedure_concept_id)
  FROM procedure_occurrence;
1 records
count(procedure_concept_id)
37409

COUNT DISTINCT for unique entries

When you have repeated values, COUNT(DISTINCT ) can help you find the number of unique values in a column:

SELECT COUNT(DISTINCT procedure_concept_id)
  FROM procedure_occurrence
1 records
count(DISTINCT procedure_concept_id)
51

We can also return the actual DISTINCT values by removing COUNT:

SELECT DISTINCT procedure_concept_id
  FROM procedure_occurrence;
Displaying records 1 - 10
procedure_concept_id
4151422
4125906
44783196
4187458
4163872
4198190
4326177
4163951
40492359
4058899

Your turn: Count the distinct values of gender_source_value in person.

Revisiting DESCRIBE

One of the important properties of data in a relational database is that there are no repeat rows in the database. Each table that meets this restriction has what is called a primary key.

DESCRIBE person
Displaying records 1 - 10
column_name column_type null key default extra
person_id INTEGER YES NA NA NA
gender_concept_id INTEGER YES NA NA NA
year_of_birth INTEGER YES NA NA NA
month_of_birth INTEGER YES NA NA NA
day_of_birth INTEGER YES NA NA NA
birth_datetime TIMESTAMP YES NA NA NA
race_concept_id INTEGER YES NA NA NA
ethnicity_concept_id INTEGER YES NA NA NA
location_id INTEGER YES NA NA NA
provider_id INTEGER YES NA NA NA

We'll see that primary keys need to be unique (so they can map to each row).

Always close the connection

When we’re done, it’s best to close the connection with dbDisconnect().

dbDisconnect(con)